Raft's Safety, Fault-Tolerance, and Availability Protocols

Let's learn how Raft ensures safety, handles leader and followers' crashes, and maintains availability.

Safety#

The previous lessons discussed how Raft selects leaders and replicates log entries. Still, additional mechanisms are needed to guarantee that every state machine executes the same commands in the same order. To see why this is the case, take an example of a follower that misses several log entries while the leader commits them. Such a follower can become the new leader and can overwrite the committed entries with new ones, resulting in different state machines executing different sequences of commands. The following slides show such a scenario:

S5 is a follower with multiple missing entries
S5 is a follower with multiple missing entries

1 of 3

S5 can get elected as the new leader
S5 can get elected as the new leader

2 of 3

S5 can overwrite all the committed entries with the entries from its log. So, this is an issue!
S5 can overwrite all the committed entries with the entries from its log. So, this is an issue!

3 of 3

To address this issue, the Raft algorithm restricts which servers can be elected as leaders to ensure that the leader for any given term contains all the entries committed in previous terms.

Election restriction#

In leader-based consensus algorithms, the leader is responsible for eventually storing all committed log entries. However, some algorithms allow a leader to be elected without initially having all committed entries. These algorithms require additional mechanisms to identify and transmit missing entries to the new leader during or after the election process, leading to increased complexity.

The following table enlists the two election restrictions set by Raft, along with their rationale.

Restriction

Rationale

Raft ensures that all committed entries from previous terms are on each new leader from the moment of its election, so there is no need to transfer them afterward.

This ensures that log entries have a unidirectional flow from leaders to followers, and leaders never overwrite their logs' existing entries.

Raft prevents a candidate from winning an election unless all the committed entries are in its log. It achieves this restriction through the voting process. The voter denies the vote's request of a candidate whose log is out-of-date, rather than their own log.

As the voting process dictates, the candidate must contact a majority of the cluster to become a leader. Any majority of the cluster node will have at least one node that has the latest committed data.


The `RequestVote` RPC enforces this restriction in Raft by including the information about the candidate's log.

Point to ponder

Question

What could be Raft’s criteria for the log to be considered up-to-date?

Hide Answer

Raft compares the index and the term of the last entries in the logs to determine which of the two logs is more up-to-date. If the logs have last entries with different terms, the log with the later term is more up-to-date. The longer log is considered more up-to-date if the logs end with the same term.

Committing entries from previous terms#

The leader of a Raft cluster confirms the commitment of an entry from the current term once it has been stored on most of the servers. However, if the leader crashes before committing an entry, future leaders will try to replicate the entry. Nonetheless, if an entry from a previous term is stored on a majority of the servers, the leader cannot immediately deduce that it has been committed.

Before discussing Raft’s approach to commit log entries from previous terms, let’s discuss the issue of an old log entry stored on the majority of servers potentially getting overwritten by a future leader. This hypothetical scenario is illustrated below:

S1 (the leader) partially replicates the log entry at index 2
S1 (the leader) partially replicates the log entry at index 2

1 of 5

S1 crashes and S5 gets elected as the leader with votes from S3, S4, and itself. It accepts a different entry at log index 2 from term 3.
S1 crashes and S5 gets elected as the leader with votes from S3, S4, and itself. It accepts a different entry at log index 2 from term 3.

2 of 5

S1 gets elected again after S5 crashes and continues replicating the log entry from term 2 on most servers, but it has not been committed yet
S1 gets elected again after S5 crashes and continues replicating the log entry from term 2 on most servers, but it has not been committed yet

3 of 5

Issue: If S1 collapses again, S5 might get elected as a leader again (with the support of S2, S3, and S4) and replaces the current entry at S1 with its own entry from term 3
Issue: If S1 collapses again, S5 might get elected as a leader again (with the support of S2, S3, and S4) and replaces the current entry at S1 with its own entry from term 3

4 of 5

Normal: S5 cannot win an election if S1 replicates an entry from its current term on the majority of the servers before collapsing. As at this instance, the log's whole history has been committed on the majority of the servers.
Normal: S5 cannot win an election if S1 replicates an entry from its current term on the majority of the servers before collapsing. As at this instance, the log's whole history has been committed on the majority of the servers.

5 of 5

Rule to commit entries from previous logs: To counter an issue (as depicted in the third slide above), Raft employs the rule of not committing log entries from previous terms by counting replicas. The leaders only commit those log entries by counting replicas, which are present in their current term. Once an entry from the current term has been committed, all previous entries are indirectly committed due to the Log Matching Property. This ensures consistency of the log up to that point across all servers.

Point to ponder

Question

Could you think of a situation where it can be safely assumed that an older log entry is committed by just counting replicas?

Hide Answer

There might be circumstances where a leader can safely deduce that an older log entry is committed, such as when it is stored on every server (the number of replicas equals the number of servers). However, Raft follows a more conservative approach for simplicity.

Rationale: Other consensus algorithms require the new leader to replicate previous entries with a new term number. In contrast, Raft keeps the original term numbers of the log entries even when a leader replicates them from previous terms. The approach used in Raft allows for the easier rationale behind log entries since they are associated with the same term number over terms and across logs. Furthermore, new leaders in Raft transmit fewer log entries from previous terms than other algorithms, which require transmitting redundant log entries to renumber them before they can be committed.

Safety argument#

The argument that Raft ensures safety rests on the Leader’s Completeness Property. It states that if a log entry gets committed in a leader’s given term, all future leaders will have that entry in their logs.

The following hint widget gives proof that this property holds.

We can now demonstrate the validity of the Leader Completeness Property in the complete Raft algorithm through a proof by contradiction. We assume that the Leader Completeness Property does not hold. Here’s how:

Suppose that the term TsT's leader (leaderTleader_T) commits a log entry from its term, but the leader (leaderUleader_U) of some future term, UU (the smallest term U > TU\ >\ T), does not store that entry.

  1. The leaderUleader_U's log must not have the committed log entry at election time.

  2. The leaderTleader_T has replicated the committed entry on a majority of the cluster, and leaderUleader_U has received votes from a majority of the cluster. There must be at least one server, the voter, that accepted both the entry from leaderTleader_T and voted for leaderUleader_U.

    If S5 is chosen as leader for a later term UU after S1 (the leader for term TT) commits a new log item from its term, then at least one server (S3) must have accepted the log entry and voted for S5 at the same time.

  3. The voter accepted the committed entry from leaderTleader_T before voting for leaderUleader_U.

  4. The voter still had the entry stored in its log when it voted for leaderUleader_U because every intervening leader contained the entry.

  5. The voter voted for leaderUleader_U, which means that the leaderUleader_U's log must have been as up-to-date as the voter’s log. This leads to one of the two contradictions.

  6. If the voter and leaderUleader_U had a similar last term in their logs, then leaderUleader_U's log must have been at least equal to the voter’s log. This is a contradiction because the voter already had the committed entry in its log, and the entry was assumed to be absent in leaderUleader_U's log.

  7. Differently, the last log term of the leaderUleader_U must have been larger than the voter’s last log term. Moreover, it was greater than TT since the voter’s last log term was at least TT. The earlier leader that created leaderUleader_U's last log entry must have contained the committed entry in its log. Then, based on the Log Matching Property, leaderUleader_U's log must also contain the committed entry, which is also a contradiction.

  8. Therefore, the leaders of all terms greater than TT must have all TT entries committed in the term TT.

  9. The Log Matching Property guarantees that future leaders will also contain indirectly committed entries.

In conclusion, we can prove that the Leader Completeness Property holds by assuming that it does not hold and then demonstrating a contradiction based on the committed entry from leaderTleader_T not being stored by the leader of some future term.

Follower and candidate crashes#

Up to this point, we have focused on handling leader failures. Handling follower and candidate crashes is simpler than leader crashes, and the Raft algorithm deals with them similarly. If a follower or candidate crashes, any future RequestVote and AppendEntries RPCs sent to it will be unsuccessful. Raft manages these failures by making indefinite retries. Later, if the crashed server restarts, the RPC will succeed. If a server crashes after executing an RPC but before responding, it will receive the same RPC again after it restarts. Since Raft’s RPCs are idempotent, the repeated RPC causes no harm. For instance, when a follower gets an AppendEntries request that includes log entries already in its log, it disregards those entries in the new request.

Unsuccessful AppendEntries and RequestVotes RPCs requests to a crashed candidate or follower
Unsuccessful AppendEntries and RequestVotes RPCs requests to a crashed candidate or follower

Timing and availability#

To ensure the safety of Raft, it is essential that the system does not produce incorrect results solely based on timing. However, the system’s availability, which measures the system’s ability to respond to clients in a timely fashion, is reliant on timing. For instance, if message exchanges take longer than the usual time between server crashes, candidates will not remain active enough to win an election. Without a stable leader, the system cannot progress.

Leader election is a critical aspect of Raft that relies heavily on timing. To elect and maintain a steady leader, the system must meet the following timing requirement:

broadcastTimeelectionTimeoutMTBFbroadcastTime ≪ electionTimeout ≪ MTBF

Here:

  • The average time a server takes to send RPCs in parallel to every other server in the cluster and receive their responses is called broadcastTime.
  • The election timeout, as described in Raft’s Leader Election Protocol, is known as the electionTimeout.
  • MTBF refers to the average time between failures for a single server.

The broadcast time must be an order of magnitude less than the election timeout to ensure leaders can reliably send the required heartbeat messages to prevent followers from initiating elections. This inequality also reduces the possibility of split votes based on the randomized approach used for election timeouts. To ensure the system progresses steadily, the election timeout has to be a few orders of magnitude less than MTBF. When the leader crashes, the system gets unavailable for nearly the election timeout, representing only a small fraction of the overall time.

While the broadcast time and MTBF are properties of the underlying system, the election timeout is a variable that we need to choose appropriately. Raft’s RPCs typically require the recipient to store information in stable storage, so depending upon the storage technology, the broadcast time may range from 0.5ms to 20ms. Consequently, the election timeout will likely be between 10ms and 500ms. The typical server MTBFs are several months or more, comfortably fulfilling the timing requirement.

In the next lesson, we’ll discuss how Raft implements cluster membership changes in its algorithm.

Raft's Log Replication Protocol

Raft's Cluster Membership Changes